Unsupervised speaker and expression factorization for multi-speaker expressive synthesis of ebooks
نویسندگان
چکیده
This work aims to improve expressive speech synthesis of ebooks for multiple speakers by using training data from many audiobooks. Audiobooks contain a wide variety of expressive speaking styles which are often impractical to annotate. However, the speaker-expression factorization (SEF) framework, which has been proven to be a powerful tool in speaker and expression modelling usually requires the (supervised) information about expressions in the training data. This work presents an unsupervised SEF method which implements the SEF on unlabelled training data in the framework of cluster adaptive training (CAT). The proposed method integrates the expression clustering and parameter estimation in a single process to maximize the likelihood of the training data. Experimental results indicate that it outperforms the cascade system of expression clustering and supervised SEF, and significantly improves the expressiveness of the synthetic speech of different speakers.
منابع مشابه
Towards an Unsupervised Speaking Style Voice Building Framework: Multi-Style Speaker Diarization
Current text–to–speech systems are developed using studio-recorded speech in a neutral style or based on acted emotions. However, the proliferation of media sharing sites would allow developing a new generation of speech–based systems which could cope with spontaneous and styled speech. This paper proposes an architecture to deal with realistic recordings and carries out some experiments on uns...
متن کاملSpeaker Adaptation in DNN-Based Speech Synthesis Using d-Vectors
The paper presents a mechanism to perform speaker adaptation in speech synthesis based on deep neural networks (DNNs). The mechanism extracts speaker identification vectors, socalled d-vectors, from the training speakers and uses them jointly with the linguistic features to train a multi-speaker DNNbased text-to-speech synthesizer (DNN-TTS). The d-vectors are derived by applying principal compo...
متن کاملA comparison of open-source segmentation architectures for dealing with imperfect data from the media in speech synthesis
Traditional Text-To-Speech (TTS) systems have been developed using especially-designed non-expressive scripted recordings. In order to develop a new generation of expressive TTS systems in the Simple4All project, real recordings from the media should be used for training new voices with a whole new range of speaking styles. However, for processing this more spontaneous material, the new systems...
متن کاملExplorer Unsupervised cross - lingual speaker adaptation for HMM - based speech synthesis
In the EMIME project, we are developing a mobile device that performs personalized speech-to-speech translation such that a user’s spoken input in one language is used to produce spoken output in another language, while continuing to sound like the user’s voice. We integrate two techniques, unsupervised adaptation for HMM-based TTS using a wordbased large-vocabulary continuous speech recognizer...
متن کاملEnsemble speaker modeling using speaker adaptive training deep neural network for speaker adaptation
In this paper, we introduce an ensemble speaker modeling using a speaker adaptive training (SAT) deep neural network (SAT-DNN). We first train a speaker-independent DNN (SIDNN) acoustic model as a universal speaker model (USM). Based on the USM, a SAT-DNN is used to obtain a set of speaker-dependent models by assuming that all other layers except one speaker-dependent (SD) layer are shared amon...
متن کامل